postQTL: a QTL mapping R workflow to improve the accuracy of true positive loci identification

Objective The determination of the location of quantitative trait loci (QTL) (i.e., QTL mapping) is essential for identifying new genes. Various statistical methods are being incorporated into different QTL mapping functions. However, statistical errors and limitations may often occur in a QTL mapping, implying the risk of false positive errors and/or failing to detect a true positive QTL effect. We simulated the power to detect four simulated QTL in tomato using cim() and stepwiseqtl(), widely adopted QTL mapping functions, and QTL.gCIMapping(), a derivative of the composite interval mapping method. While there is general agreement that those three functions identified simulated QTL, missing or false positive QTL were observed, which were prevalent when more realistic data (such as smaller population size) were provided. Results To address this issue, we developed postQTL, a QTL mapping R workflow that incorporates (i) both cim() and stepwiseqtl(), (ii) widely used R packages developed for model selection, and (iii) automation to increase the accuracy, efficiency, and accessibility of QTL mapping. QTL mapping experiments on tomato F2 populations in which QTL effects were simulated or calculated showed advantages of postQTL in QTL detection. Supplementary Information The online version contains supplementary material available at 10.1186/s13104-022-06017-z.


Introduction
Quantitative trait loci (QTL) affect the phenotypic variation observed in quantitative characters. Quantitative traits (character) display a continuous distribution of phenotypes, and many agronomically important traits (e.g., yield) belong to quantitative traits. Therefore, QTL mapping, the determination of the location of QTL, is one of the first steps to build a scientific basis for plant genetics and breeding.
Broman and Speed [14] stated that the key problem with CIM is the choice of the set of markers to use as regressors, inevitably affecting the power for QTL detection (e.g., increasing the variance of the logarithm of the odds (LOD) score). Likewise, QTL mapping is a model selection problem closely concordant with controlling the inclusion of extraneous loci while identifying as many true QTL as possible [14,15]. While an exhaustive model search would be the ideal solution to the problem,

BMC Research Notes
*Correspondence: tonggeonlee@ufl.edu 1 Horticultural Sciences Department, University of Florida, Gainesville, FL 32611, USA Full list of author information is available at the end of the article dramatically increasing computational demands by consolidating different models/large genotypic datasets are less clearly advantageous to non-specialist researchers such as plant breeders. In QTL mapping, forward selection or some version of stepwise selection is commonly used because of its computational efficiency and a close approximation of the best subset selection [14][15][16].
Recently, an R/qtl function stepwiseqtl() with a forward/ backward stepwise search algorithm was implemented in the R/qtl package (note stepwiseqtl() was added to R/qtl version 1.09 and later [15]).
Regularization methods (e.g., minimax concave penalty (MCP) [17], least absolute shrinkage and selection operator (LASSO) [18]), have been used in statistics and machine learning for feature selection (e.g., adopted in genomic prediction [19,20]). Practical implications of such methods for QTL mapping would help us to select a true QTL by distinguishing a true QTL effect from extraneous loci-driven noise (e.g., regression coefficients for a given marker shrink to near 0 where a model fits).
Given the above information, a comprehensive approach integrating the results from both mapping functions, cim() (note that a version of the CIM method is implemented into the R/qtl) and stepwiseqtl(), and regularization method(s) may be as good as a single function-driven QTL mapping, or an optimal strategy to improve the accuracy of true positive QTL identification. Here, we present postQTL, a QTL mapping R workflow that incorporates (i) both cim() and stepwiseqtl(), (ii) widely used R packages developed for model selection, and (iii) automation to increase the accuracy, efficiency, and accessibility of QTL mapping. postQTL is an R script that executes several R function()s and R packages connected by the R language. This package has been developed to be straightforward for use in a typical R environment on Linux/PC hardware. We focus on the tomato, a model fruit crop with relatively large QTL mapping efforts.

Simulating QTL mapping data
In this study, we used the R environment (version 4.1.1; [21]) with the R/qtl package (version 1.50; [22]) on both a stand-alone Windows operating PC and the UF/Research Computing Linux server, HiPerGator 3.0 [23]. A genetic map created for tomato (Solanum lycopersicum) [24] was used to simulate two different sized F 2 populations using the R/qtl function sim.cross(): F 2 individuals with 1000 (hereafter referred to as population #1) and 100 (hereafter referred to as population #2) to mimic mapping populations or breeding studies. Four simulated QTL were positioned at 5 cM on each of chromosomes 1, 5, 7, and 12, and simulated to have additive effect sizes of 1.0, 0.5, 0.2, and 0.2, respectively (Additional file 1). The missing data percentage was set at 50% and an error rate of 1e −4 was used. Three QTL mapping functions, cim(), QTL. gCIMapping() (version 3.3.1; an R function implemented into the GCIM method [25]), and stepwiseqtl() of the R/ qtl package, were used to map simulated QTL in both populations #1 and #2. For cim(), covariates 3 and 11 were chosen. For each mapping function, 100 iterations were performed and the identified QTL were reported.

postQTL
The postQTL R workflow developed for this project consists of six steps ( Fig. 1; Additional file 2). The first step of the workflow loads the postQTL R script and R packages. Recent versions of commonly used R packages were used for postQTL (the full list of R packages required is available at Additional file 3). The postQTL R script executes four function()s, map_qtl(), model_qtl(), regu-larize_qtl(), and model_chromosome(). Before executing the script, the user loads an input dataset (a single CSV file that carries the genotypic and phenotypic datasets is available at Additional file 1). After loading, postQTL simultaneously uses cim() with a covariate 11 and stepwiseqtl() to identify QTL (here performed by a function map_qtl()). postQTL performs 10 iterations for cim() and a single run for stepwiseqtl(), and then merges the identified QTL. After identifying QTL, postQTL performs an exhaustive model search for markers residing in the identified QTL using an R function regsubsets() implemented in the R package 'leaps' [26] (performed by model_qtl()), providing information about whether or not identified QTL are included in the best model. Such a model search restriction over the identified QTL reduces computational demands. For the QTL identified by stepwiseqtl(), a function find.marker() was used to find the nearest marker to the QTL. For the QTL by cim(), both a marker at the QTL peak position and one of two markers flanking the QTL peak were used. The user determines the best model(s) once a list of models with ranks is created by postQTL [e.g., choose marker(s) with top Cp value(s); note that postQTL outputs BIC and adjusted R 2 using an R package 'leaps' as optional additional resources for users). Next, postQTL calculates the regression coefficient (β) for the representative marker(s) in the identified QTL using regularization methods LASSO and MCP (performed by regularize_qtl()). LASSO and MCP were implemented using the R packages 'glmnet' (version 4.1.3; [27]) and 'picasso' (version 1.3.1; [28]), respectively. Finally, postQTL performs an exhaustive model search with a fixed model size (a default value of 3 was set) for each chromosome (e.g., all markers reside on chromosome 1) to identify best predictors (performed by model_chromosome()). Theoretically, a true QTL is more likely to be physically located near or at those predictors. The user can modify arguments listed in postQTL (e.g., chromosome, numberofpredictors) for their QTL mapping project (e.g., a large number of accurate markers vs. a few selected markers). postQTL was tested on a standalone Windows operating PC via an RStudio [29] and Linux environment [23] via a terminal emulator.

QTL mapping of tomato height
All mapping methods (cim(), QTL.gCIMapping(), and stepwiseqtl()) and postQTL were applied to continuous phenotypic data for the tomato seedling trait. 116 F 2 plants randomly selected from a segregating progeny in a population, which is known to show at least two QTL (genes) including the tomato BRACHYTIC locus [30] for height based on field evaluations, were chosen. The genotypic (F 2 generation genotyped by the tomato Illumina Infinium array and a molecular marker for the tomato BRACHYTIC locus; Additional file 4) and phenotypic (F 2 , F 2:3 , and F 2:4 ; Additional file 5) datasets were prepared as described in our previous study [30].

Results
We simulated four QTL with different effect sizes on four different chromosomes. We compared the QTL mapping results among cim(), QTL.gCIMapping(), and stepwiseqtl() on two differentially sized populations. In simulated population #1 (1000 F 2 individuals), cim() exhibited the highest frequency of the identification of all four simulated QTL followed by stepwiseqtl() (LOD score > 3.0; Fig. 2A). However, in simulated population #2 (100 F 2 individuals), a lower accuracy of the identification of simulated QTL was observed for all three mapping functions (Fig. 2B). Both QTL.gCIMapping() and stepwiseqtl() missed at least one of simulated QTL, while cim() identified all four simulated QTL in at least 8% of 100 iterations. Expectedly, the overall frequency of identifying potential false positive QTL in population #2 was higher than in population #1 ( Fig. 2C and D). Therefore, in the following sections, we demonstrate how postQTL performed a QTL mapping and its post-analysis on population #2.
Subsequently, postQTL provided a comprehensive approach integrating the results from an exhaustive model search for identified QTL and regression coefficients, which could provide a reasonable guide for QTL selection. In terms of exhaustive model search, postQTL identified markers located in six identified QTL, except for Q3 (column named 'Model search' in Table 1). All four simulated QTL (Q1, Q5, Q7, and Q12) were captured by the top models. In the estimated regression coefficient test, relatively high deviations from the β value 0 (i.e., not shrink to near 0) for both LASSO and MCP were observed from at least one of the markers representing the simulated QTL (columns named 'Regression coefficients' in Table 1). For Q2 and Q4, either an exhaustive model search failed to produce identical markers to representative marker(s) listed in the QTL, or one of two regression coefficient tests for the representative marker(s) shrunk to near 0. In the case of Q3, the exhaustive model search failed to produce any marker(s) on # Load postQTL. library(postQTL) # Load data. myqtldata <-qtl::read.cross (format=c("csvr"), file="qtl_data.csv",genotypes=c("a","b","H")) # Run a QTL mapping. map_qtl(myqtldata) # Run a model search for markers residing in the identified QTL. model_qtl("inputformodelqtl.csv") # Calculate the regression coefficient for marker(s) in the identified QTL using regularization methods. regularize_qtl("inputforregularizeqtl.csv") # Run a model search for each chromosome. model_chromosome(myqtldata,chromosome,numberofpredictors) # end Fig. 1 An overview of the postQTL R workflow that chromosome. Additionally, the estimated regression coefficient for the marker SL4.0ch03.3639459 in Q3 was close to 0.00, which gives further information to exclude Q3 from a list of potential true positive QTL.
Finally, postQTL identified the best prediction markers per chromosome by running a model search with a fixed model size of 3 (column named 'Marker prediction' in Table 1). Two markers, SL4.0ch01.2691495 and SL4.0ch12.7161, were located in the simulated QTL, Q1 and Q12, respectively. Notably, detection of the marker SL4.0ch12.7161, located in one of the small effect QTL, provided further information to control the inclusion rate of small effect QTL.
To validate that postQTL can be useful to identify potential QTL associated with the plant height trait, we performed genetic mapping analyses using postQTL and previously developed mapping functions, cim(), QTL. gCIMapping(), and stepwiseqtl(). Three QTL signals with significant effects (LOD > 4) were detected from at least two filial generations by at least two analyses: a position near the BRACHYTIC fine-mapped on chromosome 1 [30], another position (close to 5.0 Mbp) on chromosome 1 and an approximate 60.0 Mbp on chromosome 7 (Additional file 6). Importantly, postQTL exhibited the QTL signal on chromosome 7 from all three filial generations (F 2 , F 2:3 , and F 2:4 ), which indicates this location can be used to investigate desirable alleles that might explain the phenotypic segregation in the tomato population used in this study.

Discussion
Despite the availability of multiple methods for QTL mapping, a need exists for a comprehensive approach integrating the results from multiple QTL mapping methods, which may be the optimal strategy to most accurately identify QTL. We developed postQTL, an R workflow that implements two widely adopted QTL mapping functions. We used postQTL primarily in a tomato community, as tested on F 2 and other advanced populations. However, postQTL should apply to any (plant) species as long as the QTL mapping functions, cim() and stepwiseqtl(), fit the species of user interest. In the mapping of low genetic effect QTL, missing such QTL is likely to be observed when researchers repeat the mapping analysis with independent imputations. To address this, postQTL has the default number of iterations for cim() as 10.
A critical element of any QTL mapping workflow is its ease of use. postQTL is suitable for both R environment novices and experienced R users. postQTL automates the entire QTL mapping process by requiring only one R workflow. Further, postQTL only requires commonly used R packages in the R program, not requiring additional processing steps outside of the workflow.
Lastly, postQTL includes regularization methods which could be useful supplements to the researcher